6

Applications in Computer Vision

6.1

Introduction

In this section, we introduce the applications of binary neural networks in the field of com-

puter vision. Specifically, we introduce the vision tasks including person re-identification, 3D

point cloud processing, object detection, and speech recognition. First, we briefly overview

these areas.

6.1.1

Person Re-Identification

A large family of person re-id research focuses on metric learning loss. Some of them in-

troduce verification loss [248] into identification loss, others apply triplet loss with hard

sample mining [41, 203]. Recent efforts employ pedestrian attributes to improve supervision

and work for multi-task learning [213, 232]. One of the mainstream methods is horizontally

splitting input images or feature maps to take advantage of local spatial cues [132, 219, 271].

Similarly, pose estimation is incorporated into the learning of local features [212, 214]. Fur-

thermore, human parsing is used in [111] to enhance spatial matching. In comparison, our

DG-Net relies only on simple identification loss for Re-ID learning and does not require

extra auxiliary information such as pose or human parsing for image generation.

Another active research line is to use GANs [76] to augment training data. [294] is first

introduced to use unconditional GAN to generate images from random vectors. Huang et

al. proceed in this direction with WGAN [4] and assign pseudo-labels to generated images

[95]. Li et al. propose to share weights between re-id model and discriminator of GAN [76].

In addition, some recent methods use pose estimation to generate pose-conditioned images.

In [103] a two-stage generation pipeline is developed based on pose to refine the generated

images. Similarly, pose is also used in [71] to generate images of a pedestrian in different

poses to make the learned features more robust to pose variances.

Meanwhile, some recent studies also exploit synthetic data for the style transfer of pedes-

trian images to compensate for the disparity between the source and target domains. Cycle-

GAN [300] is applied in [296] to transfer the style of pedestrian image from one data set to

another. StarGAN [44] is used in [295] to generate pedestrian images with different camera

styles. Bak et al. [7] employ a game engine to render pedestrians using various illumination

conditions. Wei et al. [241] take semantic segmentation to extract the foreground mask to

assist with style transfer.

6.1.2

3D Point Cloud Processing

PointNet [192] is the first deep learning model that processes the point cloud. The ba-

sic building blocks proposed by PointNet, such as multi-layer perceptrons for point-wise

feature extraction and max/average pooling for global aggregation, have become a popular

design choice for various categories of newer backbones. PointNet++ [193] exploits the met-

DOI: 10.1201/9781003376132-6

149